load dataset
Objective
This analysis will investigate the Red Wine Dataset that is openly available on kaggle.com. A description of the data set informs that it contains objective information on a number of parameters that contribute to the rating of the quality of wine. I will be using R to ascertain information about the contents of the dataset in an attempt to determine the relationship between the fields of the dataset and the wine quality. The variable quality within the dataset will later form the basis of our linear model. Additionally a descriptive categorization of wine quality will be used to assist our analysis using the terms “good”, “bad”, and “average” to refer to the quality of wine with bad being a wine quality value below 5, average being between 5 and 7 and good being greater than 7.
## V1 fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## V1 fixed.acidity volatile.acidity citric.acid residual.sugar
## 1: 1 7.4 0.700 0.00 1.9
## 2: 2 7.8 0.880 0.00 2.6
## 3: 3 7.8 0.760 0.04 2.3
## 4: 4 11.2 0.280 0.56 1.9
## 5: 5 7.4 0.700 0.00 1.9
## ---
## 1595: 1595 6.2 0.600 0.08 2.0
## 1596: 1596 5.9 0.550 0.10 2.2
## 1597: 1597 6.3 0.510 0.13 2.3
## 1598: 1598 5.9 0.645 0.12 2.0
## 1599: 1599 6.0 0.310 0.47 3.6
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1: 0.076 11 34 0.99780 3.51
## 2: 0.098 25 67 0.99680 3.20
## 3: 0.092 15 54 0.99700 3.26
## 4: 0.075 17 60 0.99800 3.16
## 5: 0.076 11 34 0.99780 3.51
## ---
## 1595: 0.090 32 44 0.99490 3.45
## 1596: 0.062 39 51 0.99512 3.52
## 1597: 0.076 29 40 0.99574 3.42
## 1598: 0.075 32 44 0.99547 3.57
## 1599: 0.067 18 42 0.99549 3.39
## sulphates alcohol quality
## 1: 0.56 9.4 5
## 2: 0.68 9.8 5
## 3: 0.65 9.8 5
## 4: 0.58 9.8 6
## 5: 0.56 9.4 5
## ---
## 1595: 0.58 10.5 5
## 1596: 0.76 11.2 6
## 1597: 0.75 11.0 6
## 1598: 0.71 10.2 5
## 1599: 0.66 11.0 6
Summary and Structure of dataset
Upon loading the dataset we see that it contains 1,599 observations of 12 variables and I used the summary() and structure() functions to get an idea of the layout of the dataset. The variables within the dataset included fixed.acidity, volatile.acidity, citric.acid, chlorides, free.sulphur.dioxide, total.sulphur.dioxide, density, pH, sulphates, alcohol, and quality. The were some interesting observation from running the structure function, for example, the values of citric acid seems to be distributed oddly with values from 0.00 to 0.04 to 0.56. Additionally variables like sulphates, alcohol and volatile.acidity seemed to be uniformly distributed. The results of running the summary() function gave a number of statistics for the variables in the dataset including, mean, median, min and max values for the variables. For example the fixed.acidity ranged from a min of 4.60, which is acidic, to 15.90, which is basic, meanwhile alcohol had a min of 8.4, a median of 10.2 and a max of 14.9. In order to determine if the hindsight given by these functions are correct I next turn my attention to generating univariate plots in order to better visualize the distribution of values within the dataset.
Univariate plots
Histogram of Fixed Acidity
total.sulphur.dioxide, sulphates, free.sulphur.dioxide, fixed.acidity, residual.sugar, chlorides, citric.acidity, and alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution above is positiviely skewed with a few outliers at the higher ranges
Histogram of Volatile Acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of volatile acidity looks almost rectangular which is peculiar and there are a number of outliers with little data at the high ranges. The peculiar rectangular distribution may indicate an error in the data collection or that the data is incomplete
Histogram of Citric Acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric acidity also shows a positively skewed distribution with a high peak below the 0.2 value range and majority of the data points distributed between the 0.2 and 0.5 ranges
Histogram of Residual Sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual sugar is positively skewed but is odd with most of the values lying within the 1 - 2 range and very low frequency in the other values
Histogram of Chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides is similar to that of residual sugar with the slight difference being most of the values are concentrated the single value of 0.087
Histogram of Free Sulphur Dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free sulphur dioxide is positively skewed with a peak around 14 and few outliers in the high ranges
Histogram of Total Sulphur Dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Because total sulphur dioxide is a superset of free sulphur dioxide it is not a surprise that follows a similar distribution to that of free sulphur dioxide
Histogram of Density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The histogram of the values of the density shows a normal distribution
Histogram of pH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Likewise pH also shows a normal distribution
Histogram of Sulphates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates is positively skewed with most of the points being to the left of the median value of 0.62
Histogram of Alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol also shows a positively skewed distribution
Histogram of Quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Quality shows a near normal distribution with the separation of values being mostly discrete
Univariate Plot
The hist() function was used to generate plots for the variables within the dataset, however the variables had to be coerced to a numeric using as.numeric() before being passed to the hist() function which requires the input to be of type numeric. Additionally the label and colour of the plot was changed using the xlab and col parameters with the colour being set to green for the plots. The histogram of quality highlighted that majority of the wine samples within the dataset were either of bad or average quality which may indicate the reason for the odd distribution exhibited by chlorides and residual sugar.
Initial thoughts and hypothesis
The variables that are important to the wine quality, upon doing some reading, are alcohol and pH this is consistent with the histograms obtained above however variables such as sulphates, fixed acidity, citric acidty, volatile acidity, free sulphur dioxide, and total sulphur dioxide may also play a role in the wine quality and will be further explored via bivariate analysis and linear modelling. There is some expectance that acidity (fixed acidity, citric acidty, volatile acidity) would also play a role in the quality of wine if there is pH is known to be an influential variable since pH measures how acidic or basic a substance is. Additionally since the variables pH and density are the only variables, excluding quality, that exhibit a normal distribution I will investigate if there is a correlation between these two variables which explains the noted similarity.
Distribution and Outliers
1. Total Sulphur Dioxide, Sulphates, Free Sulphur Dioxide, Fixed Acidity, Residual Sugar, Chlorides, Citric Acidity, and Alcohol have a positively skewed distribution.
2.The histograms of both residual sugar and chlorides are striking because of the odd distribution of values and the existence of many outliers particularly in the values greater than 5.0 for residual sugar and values greater than 0.2 for chlorides. Due to this these variables may be omitted later when I attempt to construct linear models.
3. Density, pH and quality displayed uniform distribution and in particular.
Generation of Bivariate Plots
Wine Quality vs chlorides
## [1] -0.1289066
The minus sign indicates a negative correlation between chlorides content and quality. The correlation value is close to 0 and indicates that there is very little correlation between the two variables.
Wine Quality vs pH
## [1] -0.05773139
There is a negative correlation between wine quality and pH and the two variables are weakly correlated due to the very small correlation coefficient of 0.0577. The lower the pH, more acidic, the better the quality of the wine however the significantly small coefficient means that we can consider pH to have little to no effect on quality for our given dataset.
Wine Quality vs Alcohol Content
## [1] 0.4761663
There is a very strong positive correlation between alcohol content and wine quality with a correlation of 0.4761 therefore wine of higher quality has a higher alcohol content.
Wine Quality vs Density
## [1] -0.1749192
The negative correlation between wine quality and density tells us that wine quality increases as the density of the wine decreases.
pH vs Density
## [1] -0.3416993
pH and density have a strong negative correlation which indicates that as pH decreases, becomes more acidic, then the density of the wine decreases. This also helps to explain the correlation between density and wine quality
pH vs Alcohol Content
## [1] 0.2056325
pH and alcohol show a moderate positive correlation which suggests that the more acidic the wine sample the more alcoholic it becomes
Alcohol Content vs Density
## [1] -0.4961798
There is a strong negative correlation between density and alcohol content which implies that the density of wine decreases as alcohol content increases. This result follows from the previously observed correlations between pH and alcohol content and pH and density.
Wine Quality vs Fixed Acidity
## [1] 0.1240516
There is a weak positive correltation between fixed acidity and wine quality which is expected from the insight obtained between pH and quality.
Wine Quality vs Volatile Acidity
## [1] -0.3905578
There is a surpisingly strong negative correlation between volatile acidity and wine quality which indicates that as volatile acidity decreases the quality of the wine increases.
Wine Quality vs Citric Acidity
## [1] 0.2263725
The correlation between citric acidity is moderate and shows positive correlation
Wine Quality vs Sulphates
## [1] 0.2513971
There is also a moderate positive correlation between sulphates and wine quality which indicates that as sulphate content increases so too does wine quality.
Wine Quality vs Free Sulphur Dioxide
## [1] -0.05065606
Free sulphur dioxide is so weakly correlated with wine quality that it can be considered to have no effect on it. With a negative correlation and a very small correlation coefficient of 0.0506.
Wine Quality vs Total Sulphur Dioxide
## [1] -0.1851003
Since total sulphur dioxide is a superset of free sulphur dioxide the negative weakly correlation that is observed between it and wine quality is expected.
Fixed Acidity vs Volatile Acidity
## [1] -0.2561309
Fixed acidity and volatile acidity are moderately correlated and shows a negative correlation which informs us that as volatile acidity increases the value of fixed acidity decreases.
Citric Acidity vs Volatile Acidity
## [1] -0.5524957
Citric acidity and volatile acidity show a strong negative correlation with a coefficient of 0.5525 which is the strongest correlation observed thus far.
Free Sulphur Dioxide vs Total Sulphur Dioxide
## [1] 0.6676665
It is not surprising that free sulphur dioxide and total sulphur dioxide show a strong positive correlation since the former is a subset of the latter.
Observations
1. The strongest correlations were between quality and pH, pH and density, quality and alcohol content and quality and sulphates.
Free sulphur dioxide and pH seem to have no effect on the quality of the wine
Chlorides, pH, density, volatile acidity, free and total sulphur dioxide all show a negative correlation with wine quality
Alcohol content, fixed acidity, citric acidity, and sulphates all show a positive correlation with wine quality
Citric acidity, and sulphates had moderate correlation values meanwhile chlorides, density, fixed acidity, and total acidity showed had a weak correlation to wine quality
Better wines are more acidic, are less dense and have a higher alcohol content
Special features
The things I found interesting were the high correlation between density and pH, alcohol content and pH also the lack of correlation between quality and residual sugar. The insignificance of the effect of pH on quality may be due to the fact the pH range of the dataset was narrow and suggest an incomplete dataset. A wider range of pH within the sample may have revealed a stronger correlation that was expected.
Summary of Bivariate Plots
The plot() function was used to generate bivariate plots of interested variables in order to determine if there is any correlation. The plots are shown below and a summary is given of the inferences which can be derived from them.
The plots of the types of acidity with each other was done to determine the relationship between them and to decide which of the three to choose in the final linear model. The plot of citric acidity vs. volatile acidity informs that citric acid is a non-volatile acid and though it wasn’t plotted we would expect a positive correlation between fixed acidity and citric acidity from our investigation. The plot of fixed acidity vs. volatile acidity on the other hand shows a constant trend line with majority of the values of volatile acidity residing within 0.3 and 0.7 whilst that of fixed acidity resides between the ranges of 6.0 and 10.4. The plot of pH vs. density was highly correlated as was suspected and suggests that the changes in the density of the wine is proportional to the associated changes in its pH. The plot of pH vs. alcohol content shows that as pH increases so too does the alcohol content however the interesting observation is that the ranges of pH are 3.0 and 3.6 which is acidic.
The plot of pH vs. wine quality shows that as the pH so too does the quality of the wine with a few outliers particularly at wine quality value of 8. The observations for the correlation between citric acidity is expected from the observations of pH. The plot of quality vs. free and total sulphur dioxide shows the interesting trend that very low concentrations of sulphur dioxide produces bad wine but very high concentrations produces wine of average quality therefore the concentration of sulphur dioxide needs to be intermediate to produce good wine. The plot of wine quality vs. chlorides showed as was expected that there is no correlation between the quality of the wine and its chloride content and from the dataset that the amount of chloride remained constant across the different quality values. The plot of pH vs. sulphates shows a similar trend with that of sulphur dioxide in that good wine has an intermediate concentration of sulphates, between 0.5 and 1.0, with too high concentrations producing average wine. The plot of residual sugar was omitted and its trend is asserted from its similarity with chlorides from the univariate plots also the plots of pH vs. the various types of acidity were omitted since a linear correlation can be interpolated from the other plots that were done. From these observations we can proceed with carefully selecting the variables from which to construct our linear models.
Multivariate Plots
Fixed Acidity vs Cirtic Acidity As suspected from the bivariate plots we see that fixed acidity and citric acidity are highly correlated.
Volatile Acidity vs Citric Acidity High citric acidity and low volatile acidity seems to produce better wines
Alcohol vs pH Again we see that low pH and high alcohol content corresponds to better quality wines.
Alcohol vs Sulphates There is a near constant trendline over all the wine quality values which suggests that low conentrations of sulphate and high contretrations of alcohol contributes to better quality wine.
Alcohol vs Density The strong negative correlation between alcohol and density supports our previous assertion that it is the changes in alcohol content that are responsible for the changes in density.
Alcohol vs Residual Sugar The general observation is that there seems to be no correlation between residual sugar and alcohol content which is particularly true for average and good quality wine.
Alcohol vs Total Sulphur Dioxide The observation is that low values of total sulphur dioxide with high concentrations of alcohol content results in better quality wines.
Alcohol vs Volatile Acidity Low concentrations of volatile acidity and higher concentrations of alcohol content seems to produce better quality wines.
pH vs Density Small values of density and large acidic values of pH seems to give better quality wines
Summary of Multivariate Plots
The values of multiple variables were compared with the quality of wine using the ggplot() function and the graphs are given above. As was suspected the first plot highlights that the majority of the dataset consists of wine of average quality with very few data points for poor and good quality. It seems evident that the correlation between density and quality is as a result of the effect of changes in alcohol content. The comparison between alcohol content and residual sugar showed no correlation for most of the dataset however there is a positive correlation for poor quality wine. This suggests that poor wine has higher levels of residual sugar which may explain its lower alcohol content and its low quality rating. With the insights we have obtained from the previous analyses we can proceed to the construction of our linear models.
Construction of Linear Models
Quality as a function of alcohol content
##
## Call:
## lm(formula = as.numeric(wine_qual$quality) ~ wine_qual$alcohol,
## data = wine_qual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## wine_qual$alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
Quality as a function of pH
##
## Call:
## lm(formula = as.numeric(wine_qual$quality) ~ wine_qual$pH, data = wine_qual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6817 -0.6394 0.3032 0.3878 2.4874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6359 0.4332 15.320 <2e-16 ***
## wine_qual$pH -0.3020 0.1307 -2.311 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8065 on 1597 degrees of freedom
## Multiple R-squared: 0.003333, Adjusted R-squared: 0.002709
## F-statistic: 5.34 on 1 and 1597 DF, p-value: 0.02096
Quality as a function of alcohol and pH
##
## Call:
## lm(formula = as.numeric(wine_qual$quality) ~ wine_qual$alcohol +
## wine_qual$pH, data = wine_qual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7153 -0.4066 -0.1105 0.5076 2.4584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.42581 0.38742 11.424 < 2e-16 ***
## wine_qual$alcohol 0.38617 0.01676 23.036 < 2e-16 ***
## wine_qual$pH -0.85011 0.11571 -7.347 3.23e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6989 on 1596 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.2511
## F-statistic: 268.9 on 2 and 1596 DF, p-value: < 2.2e-16
Quality as a function of alcohol, pH and sulphates
##
## Call:
## lm(formula = as.numeric(wine_qual$quality) ~ wine_qual$alcohol +
## wine_qual$pH + wine_qual$sulphates, data = wine_qual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.78838 -0.38275 -0.09267 0.50333 2.17815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.34450 0.40087 8.343 < 2e-16 ***
## wine_qual$alcohol 0.36684 0.01658 22.130 < 2e-16 ***
## wine_qual$pH -0.63526 0.11619 -5.467 5.29e-08 ***
## wine_qual$sulphates 0.86808 0.10402 8.345 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6843 on 1595 degrees of freedom
## Multiple R-squared: 0.2833, Adjusted R-squared: 0.282
## F-statistic: 210.2 on 3 and 1595 DF, p-value: < 2.2e-16
Quality as a function of alcohol, pH, sulphates and total sulphur dioide
##
## Call:
## lm(formula = as.numeric(wine_qual$quality) ~ wine_qual$alcohol +
## wine_qual$pH + wine_qual$sulphates + wine_qual$total.sulfur.dioxide,
## data = wine_qual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79142 -0.37584 -0.05736 0.47139 2.11430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.642841 0.402526 9.050 < 2e-16 ***
## wine_qual$alcohol 0.350005 0.016806 20.826 < 2e-16 ***
## wine_qual$pH -0.641768 0.115355 -5.563 3.10e-08 ***
## wine_qual$sulphates 0.898584 0.103451 8.686 < 2e-16 ***
## wine_qual$total.sulfur.dioxide -0.002612 0.000529 -4.936 8.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6793 on 1594 degrees of freedom
## Multiple R-squared: 0.2941, Adjusted R-squared: 0.2923
## F-statistic: 166 on 4 and 1594 DF, p-value: < 2.2e-16
Linear models
The lm() function was used to create models for the quality of wine as a function of the variables alcohol, pH, sulphates, and total sulphur dioxide while the summary() functions was used to view the results. The first model only consisted of wine quality as a function of alcohol content with an intercept of 1.87497 and a slope of 0.36084 also the R2 value was 0.2267 which indicates that alcohol content accounts for 22.67% of the observed variation. This further supports the claim that alcohol content is highly correlated with wine quality. The second model consisted of quality as a function of pH only with an intercept of 6.6359 and a slope of -0.3020 and a R2 value of 0.003333 which indicates that the pH alone isn’t sufficient to account for the observed variance in the dataset. The next model was constructed using both pH and alcohol as independent variables for the quality which resulted in a R2 value of 0.252 indicating that the addition of pH only marginally improves the account of the variance. The following model included sulphates which resulted in an intercept of 3.34450 and a R2 value of 0.2833 which similar to the addition of pH accounted for an additional 3% of the variance. The next model saw the addition of total sulphur dioxide which resulted in an intercept of 3.642841 and a R2 value of 0.2941 thus accounting less than 1% of the observed variance. Based on the models we can conclude that the factor with the largest contribution to the wine quality is most definitely the alcohol content. The reason for the low contribution of the other factors and the small account of the variance that is accounted for by alcohol and the other factors combined, 29%, may be due to the fact that the dataset is predominantly comprised of average quality wine and the statistical results would be improved if a more inclusive and complete dataset was used.
Final Plots & Summary
From the linear model we notice that alcohol content and sulphate concentration had major contributions to the wine quality. The correlation between pH and quality was also something that i’d like to highlight and thus the reason for the three chosen plots.
Plot 1
Alcohol content vs Wine Quality This plot indicates that wine quality increases as the alcohol percentage increases. Alcohol content accounts for 22% of the observed variance however the remainder of the variance is not well acounted for by the remaining variables as was highlighted from the linear models. As mentioned before the current dataset predominantly consists of average quality wine and therefore a more complete dataset may be required to account for the unaccounted variance.
Plot 2
Effect of Alcohol and Sulphates on Wine Quality This plot reveals that the best quality wines have high a high alcohol percentage and low sulphate concentration. The slightly negative slopes indicate that in better quality wine the alcohol percentage is slightly greater than the sulphate concentration.
Plot 3
Influence of pH on Wine Quality This plot also reveals that better quality wines have a lower pH in particular the pH range is acidic. A thing to note is that the pH range of the dataset is narrow and lies between 3.2 and 3.5.
Conclusion
The red wine dataset presents an interesting collection of data to conduct exploratory statistics using R as tool for the analysis. It was discovered that the dataset was largely comprised of wine of average quality which limited the results of the types of analysis that could be conducted to account for the variance within the data. Of the 12 variables within the dataset it was discovered that two of them, chlorides and residual sugar, seemed to play no significance to the quality of the wine. It was also discovered that there is a positive correspondence between pH and density and pH and alcohol content which both were correlated with the quality of the wine. The most impactful variable was alcohol content which accounted for 22% of the variance in the dataset with the combined impact of the other significant variables, pH, sulphates, and total sulphur dioxide only accounting for the remaining 7%. In the future I would like to have a more complete dataset which was uniformly representative of the different quality of wine in order to account for the remaining 71% of variance.